In this project , i will select map data for an area from OpenStreetMap and use data wrangling techniquesto assess the validity, accuracy, completeness, consistency and uniformity of the data. I will clean the dataset, upload it to mongodb and make my own analysis to find some interesting insights.
The area I choose to look at is Kenmore,Washington, which is the city i live in. I downloaded the dataset from OpenStreetMap via Overpass API. I will apply the techniques learned from Udacity's Data Wrangling with MongoDB course to explore, audit and clean this dataset then convert the xml to JSON.
from IPython.display import Image
Image("kenmore.png", width ="600")
run step1_download_data.py
Before adding it to Mongodb, we need to check if there're any data quality issues in there. I will parse through it with ElementTree and start with counting different elements.
run step2_auditingdata.py, get
{'bounds': 1, 'member': 8145, 'meta': 1, 'nd': 341142, 'node': 303563, 'note': 1, 'osm': 1, 'relation': 325, 'tag': 194947, 'way': 33921}
Because what i'm most interested in is the tag keys. I then looped through all tag keys and printed them.
run step2_auditingdata.py, get
[('addr:housenumber', 69105), ('addr:street', 69060), ('addr:postcode', 68970), ('addr:city', 68868), ('building', 68445), ('source', 40278), ('highway', 32385), ('name', 19917), ('addr:unit', 12804), ('tiger:county', 12669), ('tiger:cfcc', 12597), ('tiger:name_base', 12009), ('tiger:reviewed', 8835), ('tiger:name_direction_suffix', 6696), ('service', 5166), ('tiger:name_direction_prefix', 4854), ('power', 4614), ('lanes', 4398), ('maxspeed', 3792), ('tiger:name_type', 3405), ('amenity', 2925), ('ref', 2664), ('oneway', 2577), ...]
Finally, i also audited if there're any problematic tag keys that are not available Mongodb, i printed number of keys for each type and found no problematic keys.
run step2_auditingdata.py, get
{'lower': 72762, 'lower_colon': 120222, 'other': 1963, 'problemchars': 0}
I found first of all the inconsistencies in the street names, with some use abbreviations while some not. I then updated the abbreviations with the full names. Here're two examples, for those whose ending with numbers, i put them to the front: 15th Ave Northeast => 15th Avenue Lake City Way Northeast #183 => No.183 Lake City Way Northeast The following is the process:
1. create the regex format of finding abbreviations, also create the expected full name and the mapping of abbreviation to its full name.
2. Audit each value within <tag> with key equals "addr:street", if the pattern is not expected, update using the mapping dictionary.
3. Save the file with the updated street name.
run step3_audit_streetname.py, get
124th Avenue Northeast #A => No A 124th Avenue Northeast 15TH Ave Northeast => 15TH Avenue Northeast Juanita-Woodinville Way Northeast => Juanita Woodinville Way Northeast Pinehurst Pocket Park;19th Avenue Northeast => Pinehurst Pocket Park 19th Avenue Northeast 15th Ave Northeast => 15th Avenue Northeast Lake City Way Northeast #183 => No 183 Lake City Way Northeast 233rd St SW => 233rd Street Southwest 25th Ave NE => 25th Avenue Northeast 138th Way NE => 138th Way Northeast 18006 120TH AVE NE => 18006 120TH Avenue Northeast 104th Ave NE => 104th Avenue Northeast 141st Pl NE => 141st Place Northeast Ballinger Way NE => Ballinger Way Northeast Bothell Way NE => Bothell Way Northeast 120th Avenue NE => 120th Avenue Northeast NE 158th St => Northeast 158th Street NE 186th St => Northeast 186th Street Main St => Main Street NE Bothell Way => Northeast Bothell Way Bothell-Everett Highway => Bothell Everett Highway
These are the names finally been updated.
Finally update all incorrect street names and save the new file as 'kenmore_updated_street.xml'
By checking again, we found there're no street names needed to be updated. Auditing done.
I also found inconsistencies in the city names. Some city name's are miss spelled and some are not capitalized. Then i updated the city names to fix the wrong spelling issue. The following is the process:
1. create a correct list of city names in the map area.
2. Audit each "addr:city" value and list all city names in the file.
3. Because there're only 13 different city names in total, easy to find the incorrect spellings and non-capitalization names. Update them to the correct name and format.
4. Save the new file with city name updated.
run step4_audit_cityname.py, get
{'BOTHELL', 'Bothel', 'Bothell', 'Edmonds', 'Kenmore', 'Kirkland', 'Lake Forest Park', 'Mountlake Terrace', 'Seattle', 'Shoreline', 'Woodenville', 'Woodinville', 'kenmore'}
By pulling all city namesin the file, there're only 13 names, so its easy to audit and make corrections using if-else logic.
Finally update all incorrect city names and save the new file as 'kenmore_updated_streetcity.xml'
By checking again, we found there're no city names needed to be updated. Auditing done.
The following is the process:
1. Process 2 types of top level tags: node and way.
2. All attributes of node and way should be turned into key/value pairs.
3. Problematic tag keys are ignored.
4. Tag keys starts with "addr:" will be added to a dictionary address.
run step5_convert_xmltojson.py to make the convertion
I used pymongo to create the requested database and used subprocess package to run shell commands to import json files.
from subprocess import check_output, Popen, PIPE
osm = re.sub(r'[\s]', ' ', check_output("du -h kenmore_updated_streetcity.xml".split()))
json = re.sub(r'[\s]', ' ', check_output("du -h kenmore_updated_streetcity.xml.json".split())); print(osm + '\n' + json)
import signal
import subprocess
import os
pro = subprocess.Popen("mongod", preexec_fn = os.setsid)
from pymongo import MongoClient
import bson
db_name = "udacity_proj"
client = MongoClient()
db = client[db_name]
collection = "osm_kenmore"
json_file = cleaned_city_name + ".json"
mongoimport_cmd = "mongoimport --db " + db_name + \
" --collection " + collection + \
" --file " + json_file
# drop collection if it exists
if collection in db.collection_names():
print "dropping collection"
db[collection].drop()
# Execute the command
print "Executing: " + mongoimport_cmd
subprocess.call(mongoimport_cmd.split())
data_db = db[collection]
data_db.find().count()
data_db.find({"type":"node"}).count()
data_db.find({"type":"way"}).count()
len(data_db.distinct("created.user"))
for doc in data_db.aggregate([
{"$match": {'created.user': {"$exists": 1} }},
{"$group": {"_id": '$created.user', "count": {"$sum": 1}}},
{"$sort": {'count': -1}},
{"$limit": 10}
]):
print doc
for doc in data_db.aggregate([
{"$match": {'amenity': {"$exists": 1} }},
{"$group": {"_id": '$amenity', "count": {"$sum": 1}}},
{"$sort": {'count': -1}},
{"$limit": 10}
]):
print doc
for doc in data_db.aggregate([
{"$match": {'address.street': {"$exists": 1} }},
{"$group": {"_id": '$address.street', "count": {"$sum": 1}}},
{"$sort": {'count': -1}},
{"$limit": 1}
]):
print doc
Besides the statistics above, i'm also interested in verifying the most popular fast food restaurant in this area. In my memory, Subway should have most stores. I also want see if Starbuck and BOA both have more stores than their competitors in this area, just like bellevue downtown.
Through some aggregations, i found:
1. Subway has 8 stores, higher than all other competitors.
2. Startbuck is dominant in Kenmore area.
3. BOA has 6 store in this area, which is the same as its biggest competitor Chase.
These findings demonstared my assumption, which is interesting.
for doc in data_db.aggregate([
{"$match": {'amenity': 'fast_food' }},
{"$group": {"_id": '$name', "count": {"$sum": 1}}},
{"$sort": {'count': -1}},
{"$limit": 5}
]):
print doc
for doc in data_db.aggregate([
{"$match": {'amenity': 'cafe' }},
{"$group": {"_id": '$name', "count": {"$sum": 1}}},
{"$sort": {'count': -1}},
{"$limit": 5}
]):
print doc
for doc in data_db.aggregate([
{"$match": {'amenity': 'bank' }},
{"$group": {"_id": '$name', "count": {"$sum": 1}}},
{"$sort": {'count': -1}},
{"$limit": 5}
]):
print doc
The map data of Kenmore area has a good data quality in general. However, there're some inconsistencies in street names where some have abbreviations and there're incorrect spellings in city names. All these issues are found by auditing and finnaly fixed by replacing with the correct formats.The project tells me the importance of data quality and teaches me how to QA the data efficiently.